Kwakye and Phil have been collecting weather data that dates back as far as 1921 to the present day. My task is to search for data online that could be correlated with this weather data of ours.
Our weather data contains the following:
There are two types of possible comparisons with these datasets: Within comparisons, where we compare the data within it’s own dataset (its own variables against one another), or between comparisons, comparing datasets with one another. It’s worth doing some within comparisons first before between comparisons, so that we really understand the data we’re investigating.
The first is a dataframe that contains information about the quality of air in London between 2008-present day, by month.
In order to compare these two dataframes, it’s essential to manipulate them so that they have similar formatting. Our weather data will need to be aggregated by day and month to compare with the london.gov data that’s also by day and month.
Below is the code for the processes of this report, including:
and so on.
### Server connector
weatherConn <- odbcDriverConnect('driver={SQL Server};server=rlibdbwp01.twutil.net;database=weatherDB;trusted_connection=true')
#if we want to get everything from the weather data..
# sqlParam <- "SELECT * FROM daily "
# dailyData <- sqlQuery(weatherConn, sqlParam)
### Gets smd paramaters from the server
sqlParam <- "SELECT * FROM daily WHERE [RecordTypeNumber] = 2 AND [Location] = 'Central London' AND [Date] >= '1 Jan 2008' " #London Air Quality data only goes back as far as 2008. Phil suggested using "record type number 2".
dailyData <- sqlQuery(weatherConn, sqlParam)
#Extracting Month and Years
dailyData$Date <- as.Date(dailyData$Date)
dailyData$Month <- months(dailyData$Date)
dailyData$Year <- format(dailyData$Date,format="%y")
dailyData <- dailyData %>%
mutate(Difference = Maxtemp - Mintemp)
#Averages of the variables
#by month and year type, as this is how the data from gov.uk is laid out.
DailyDataMean <- aggregate(cbind(Maxtemp, Mintemp, Sunshine, Precipitation, Frost) ~ Month + Year, data = dailyData, mean)
#Q: are we interested in frost??
#importing dataframes
#roadside
Air_quality_london_roadside <- read_excel("H:/my office/Gwen Stuff/Projects/Project4 - Weather Data/WeatherDataVSOpenSourceData/Copy of air-quality-london.xls",
sheet = "Monthly_averages_roadside")
#background
Air_quality_london_background <- read_excel("H:/my office/Gwen Stuff/Projects/Project4 - Weather Data/WeatherDataVSOpenSourceData/Copy of air-quality-london.xls",
sheet = "Monthly_averages_background")
#renaming so it's easier to plot
setnames(Air_quality_london_roadside, old = c('Nitric Oxide (ug/m3)','Nitrogen Dioxide (ug/m3)', 'Oxides of Nitrogen (ug/m3)', 'Ozone (ug/m3)', 'PM10 Particulate (ug/m3)', 'PM2.5 Particulate (ug/m3)', 'Sulphur Dioxide (ug/m3)'), new = c('Nitric_Oxide_R','Nitrogen_Dioxide_R', 'Oxides_of_Nitrogen_R', 'Ozone_R', 'PM10_Particulate_R', 'PM2.5_Particulate_R', 'Sulphur_Dioxide_R'))
setnames(Air_quality_london_background, old = c('Nitric Oxide (ug/m3)','Nitrogen Dioxide (ug/m3)', 'Oxides of Nitrogen (ug/m3)', 'Ozone (ug/m3)', 'PM10 Particulate (ug/m3)', 'PM2.5 Particulate (ug/m3)', 'Sulphur Dioxide (ug/m3)'), new = c('Nitric_Oxide_B','Nitrogen_Dioxide_B', 'Oxides_of_Nitrogen_B', 'Ozone_B', 'PM10_Particulate_B', 'PM2.5_Particulate_B', 'Sulphur_Dioxide_B'))
#Extracting Month and Years for roadside air quality
Air_quality_london_roadside$Date <- as.Date(as.yearmon(Air_quality_london_roadside$Month)) #function from the library package "zoo"
Air_quality_london_roadside$Month <- months(Air_quality_london_roadside$Date)
Air_quality_london_roadside$Year <- format(Air_quality_london_roadside$Date,format="%y")
#Extracting Month and Years for background air quality
Air_quality_london_background$Date <- as.Date(as.yearmon(Air_quality_london_background$Month))
Air_quality_london_background$Month <- months(Air_quality_london_background$Date)
Air_quality_london_background$Year <- format(Air_quality_london_background$Date,format="%y")
####Merging TW Weather data and open source data together----
WeatherDataVsGovUK <- Reduce(function(x, y) merge(x, y, by=c("Month", "Year")), list(DailyDataMean, Air_quality_london_background, Air_quality_london_roadside))
# OzoneVSTemp<- ggplot(WeatherDataVsGovUK, aes(x = Date.x, label = Date.x))+
# geom_line(aes(y = Ozone_B, colour = "ozone Background levels (ug/m3)"))+
# geom_line(aes(y = Ozone_R, colour = "ozone Roadside levels (ug/m3)"))+
# geom_line(aes(y = Mintemp, colour = "min temperature (°C)")) +
# geom_line(aes(y = Maxtemp, colour = "max temperature (°C)")) +
# labs(x = "Date", y = "Temp&Ozone")+
# theme(legend.position="none")
# ggtitle("Ozone Levels (gov.uk) VS Temp")
tmp1 <- scale_colour_manual("Legend", values = c(ozone_B = "red", ozone_R = "#990000", min_temp = "blue", max_temp = "purple"))
tmp2 <- scale_colour_manual("Legend", values = c(Sunshine = "#FFCC33", ozone_B = "red", ozone_R = "#990000"))
tmp3 <- scale_colour_manual("Legend", values = c(Nitrogen_Dioxide_B = "#006600", Nitrogen_Dioxide_R = "green", min_temp = "blue", max_temp = "purple"))
tmp4 <- scale_colour_manual("Legend", values = c(Sunshine = "#FFCC33", Nitrogen_Dioxide_B = "#006600", Nitrogen_Dioxide_R = "green"))
tmp5 <- scale_colour_manual("Legend", values = c(Sulphur_Dioxide_B = "#00CCCC", Sulphur_Dioxide_R = "#FF66CC", min_temp = "blue", max_temp = "purple"))
tmp6 <- scale_colour_manual("Legend", values = c(Sunshine = "#FFCC33", Sulphur_Dioxide_B = "#00CCCC", Sulphur_Dioxide_R = "#FF66CC"))
OzoneVSTemp<- ggplot(WeatherDataVsGovUK, aes(x = Date.x, label = Date.x))+
geom_line(aes(y = Ozone_B, colour = "ozone_B"))+
geom_line(aes(y = Ozone_R, colour = "ozone_R"))+
geom_line(aes(y = Mintemp, colour = "min_temp")) +
geom_line(aes(y = Maxtemp, colour = "max_temp")) +
labs(x = "Date", y = "Ozone&Temp")+
ggtitle("Ozone levels (gov.uk) VS Sunshine (TW)")+
tmp1
OzoneVSSunshine <- ggplot(WeatherDataVsGovUK, aes(x = Date.x, label = Date.x))+
geom_line(aes(y = Ozone_B, colour = "ozone_B"))+
geom_line(aes(y = Ozone_R, colour = "ozone_R"))+
geom_line(aes(y = Sunshine, colour = "Sunshine")) +
labs(x = "Date", y = "Ozone&Sunshine")+
ggtitle("Ozone Levels (gov.uk) VS Sunshine (TW)")+
tmp2
NDVSTemp<- ggplot(WeatherDataVsGovUK, aes(x = Date.x, label = Date.x))+
geom_line(aes(y = Nitrogen_Dioxide_B, colour = "Nitrogen_Dioxide_B"))+
geom_line(aes(y = Nitrogen_Dioxide_R, colour = "Nitrogen_Dioxide_R"))+
geom_line(aes(y = Mintemp, colour = "min_temp")) +
geom_line(aes(y = Maxtemp, colour = "max_temp")) +
labs(x = "Date", y = "Nitrogen Dioxide&Temp")+
ggtitle("Nitrogen Dioxide Levels (gov.uk) VS Temperature (TW)")+
tmp3
NDVSSunshine<- ggplot(WeatherDataVsGovUK, aes(x = Date.x, label = Date.x))+
geom_line(aes(y = Nitrogen_Dioxide_B, colour = "Nitrogen_Dioxide_B"))+
geom_line(aes(y = Nitrogen_Dioxide_R, colour = "Nitrogen_Dioxide_R"))+
geom_line(aes(y = Sunshine, colour = "Sunshine")) +
labs(x = "Date", y = "Nitrogen Dioxide Sunshine")+
ggtitle("Nitrogen Dioxide Levels (gov.uk) VS Sunshine (TW)")+
tmp4
SDVSTemp<- ggplot(WeatherDataVsGovUK, aes(x = Date.x, label = Date.x))+
geom_line(aes(y = Sulphur_Dioxide_B, colour = "Sulphur_Dioxide_B"))+
geom_line(aes(y = Sulphur_Dioxide_R, colour = "Sulphur_Dioxide_R"))+
geom_line(aes(y = Mintemp, colour = "min_temp")) +
geom_line(aes(y = Maxtemp, colour = "max_temp")) +
labs(x = "Date", y = "Sulhur Dioxide&Temp")+
ggtitle("Sulphur Dioxide Levels (gov.uk) VS Temperature (TW)")+
tmp5
SDVSSunshine<- ggplot(WeatherDataVsGovUK, aes(x = Date.x, label = Date.x))+
geom_line(aes(y = Sulphur_Dioxide_B, colour = "Sulphur_Dioxide_B"))+
geom_line(aes(y = Sulphur_Dioxide_R, colour = "Sulphur_Dioxide_R"))+
geom_line(aes(y = Sunshine, colour = "Sunshine")) +
labs(x = "Date", y = "Sulphur Dioxide&Sunshine")+
ggtitle("Sulphur Dioxide Levels (gov.uk) VS Sunshine (TW)")+
tmp6
p <- ggplotly(subplot(OzoneVSTemp, OzoneVSSunshine, NDVSTemp, NDVSSunshine, SDVSTemp, SDVSSunshine, nrows = 3, margin = 0.05, titleY = TRUE)) %>%
layout(showlegend = FALSE, title = "Air Quality Data (gov.uk) against temperature & Sunshine (TW)")
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
ggplotly(p)
Here are some line graphs of the variables we’re most likely interested in between our weather data against London.gov.uk’s data on air quality. We see these variables ploted against time: how they all relate as time passes.
The x-axis is date, from 2008-2018, as the air quality data only goes as far back as 2008. The data is monthly, meaning that each point that we hover on is the average for that month. For example, if we hover over the top right hand plot (Ozone levels VS. Temperature), over the minimum line and see the first point for temperature (01/01/2008) is 6.19 degrees celsius, what we mean is that the average temperature across all the minimum temperatures for each day in January 2008 is 6.19 degrees as a mean estimate. Likewise, the average temperature across all the maximum temperatures in January 2008 is 11.08 degrees.
Another point to make: because we are plotting all our measurements monthly, if we refer to “anywhere between month a and month b, we only refer to singular monthly points between them. I.e between January and March would be the 3 points: the average of January, average of February and the average for March.
The y-axis is “value”, with no units. This is so we can plot different variables with different units on the same plots indiscriminately. The label of the y-axis is simply the two fields we’re plotting: 1 from london.gov.uk, the other from our weather data. For each field from the air quality data come 2 variables: data from the roadside and data from the background.
For example,
There are also variations of Temperature from our weather data:
London.gov.uk Quality Data Variables:
Thames Water Weather Data Variables:
For all these variables, they peak & reach their lowest points only once (occasionally twice) a year.
ggplotly(OzoneVSTemp)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Ozone and temperature rise and fall similarly between seasons. When ozone is rising, temperature is rising and when ozone is falling, so is temperature. It seems that they are correlated to one another, and that one is influencing the other. This is what we predict for our data, as outside research suggests they do indeed have a relationship.
What I noticed here though is that there is a short delay in the peak of temperature compared with the ozone peak. When ozone is at it’s peak, its not until a few months later until temperature reaches its peak. There’s also a short delay in the lowest temperature point compared with the lowest levels of ozone. These delays are between 1-3 months.
Temperature also peaks at the same time each year: July/occassionally August
Temperature is at its lowest between December-February.
ggplotly(OzoneVSSunshine)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Now we look into sunshine against ozone. Sunshine also rises and falls with ozone levels throughout the year, consistently from 2008-2018. It also has a delay in peak compared with ozone, as did temperature. Maybe because temperature delayed from the previous plot is why sunshine did, because temperature and sunshine are related? We still can’t be too sure! But this is my guess: that a peak & drop delay in sunshine compared with ozone’s levels may be related to temperature’s delaying pattern.
Sunshine usually peaks in June/July, but not always the case. Between 2008-18, it peaks anywhere between April and August.
Sunshine is at it’s lowest anywhere between November-January.
ggplotly(NDVSTemp)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Next we introduce Nitrogen Dioxide, first plotted against temperature.
Nitrogen Dioxide follows a similar fluctuating pattern just like ozone, temperature and sunshine that we’ve seen previously, fluctuating between seasons and years. However, there is a noticeable difference: When temperature peaks, Nitrogen Dioxide levels are at their lowest. Nitrogen Dioxide patterns are almost the complete opposite to what was seen with Ozone.
Furthermore, there doesn’t seem to be a consistent delay of peak or drop for either temperature of Nitrogen dioxide. Some years, temperature will peak sooner than Nitrogen Dioxide reaches its lowest point (e.g. 2008). Other years, Nitrogen Dioxide reaches its lowest point before temperature peaks (e.g. 2012). This may be explained by the more eratic behaviour of the changes in Nitrogen Dioxide (more frequent fluctuations) compared with ozone levels.
Temperature peaks at the same time each year: July/occassionally August
Temperature is at its lowest between December-February. (winter)
Another thing I notice from this plot is that the difference between roadside & background levels of Nitrogen Dioxide is greater than the difference between roadside & background levels of Ozone.
ggplotly(NDVSSunshine)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Not much more can be said about this plot than the previous one. Sunshine levels peak approximately when Nitrogen Dioxide levels are at their lowest.
ggplotly(SDVSTemp)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Sulhur Dioxide fluctuates much more abruptly than the other variables that we’ve seen: not such obious dips and peaks as the others. It’s not very clear, but sulphur dioxide levels tends to peak when temperature is at its lowest and vice versa, like Nitrogen Dioxide.Lowest sulphur dioxide 0.04 (2dp) on 10/2015. Highest at 7.61 on 01/2017 where the value is the same figure for temperature.. the lines meet.
ggplotly(SDVSSunshine)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Due to the closeup on the scaling, it’s even more obvious the ferocity of the Sulphur Dioxide fluctuations, when we see them plotted with sunshine. Sulphur Dioxide levels do peak when sunshine is at its lowest, and vice versa.
We can also do correlation plots for this data! I have decided to do Spearman correlations for this data, as I am interested in the monotonicity of the data between the two variables than the linearity.
drops1 <- c("Month","Year", "Date.x", "Date.y")
corr1 <- cor(WeatherDataVsGovUK[ , !(names(WeatherDataVsGovUK) %in% drops1)], use="pairwise.complete.obs", method=c("spearman"))
p<-ggcorrplot(corr1, hc.order = TRUE, type = "lower",
outline.col = "black",
ggtheme = ggplot2::theme_gray,
colors = c("#6D9EC1", "white", "#E46726"))+
ggtitle("Correlation of variables between Air Quality Data and TW Weather Data in London, 2008-2018")
ggplotly(p)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Here, all of the variables from our two datasets are correlated against one another. Red colours imply a positive correlation between two variables, whereas blue colours imply a negative correlation. The stronger the colour, the stronger the correlation. The strength in colour to indicate colour is universal practice among most correlation plots of this type/this package.
The plot is of type “lower”, which means that the plot excludes repitions for the same correlation and misses out correlations for the same two variables.
The first thing that catches my eye is the whole block of blue, for temperature and sunshine are negatively correlated with almost every field from the Air Quality data apart from ozone.
From our Weather data, ozone is most correlated with sunshine (0.71), indicating higher ozone levels on sunnier days.
Oxides of Nitrogen seem to be most strongly (negatively) correlated with our weather data.. whereas there’s not much of a correlation with sulphur dioxide. Precipitation (rain..) has hardly any relationship with our London’s air quality data, which is surprising…
drop_Roadside <- c('Month','Year', 'Date.x', 'Date.y', 'Nitric_Oxide_R','Nitrogen_Dioxide_R', 'Oxides_of_Nitrogen_R', 'Ozone_R', 'PM10_Particulate_R', 'PM2.5_Particulate_R', 'Sulphur_Dioxide_R')
corr_Roadside <- cor(WeatherDataVsGovUK[ , !(names(WeatherDataVsGovUK) %in% drop_Roadside)], use="pairwise.complete.obs", method=c("spearman"))
r<-ggcorrplot(corr_Roadside, hc.order = TRUE, type = "lower",
outline.col = "white",
ggtheme = ggplot2::theme_gray,
colors = c("#6D9EC1", "white", "#E46726"))+
ggtitle("Correlation of variables between Air Quality Data (Roadside) and TW Weather Data in London, 2008-2018")
ggplotly(r)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
drop_Background <- c('Month','Year', 'Date.x', 'Date.y', 'Nitric_Oxide_B','Nitrogen_Dioxide_B', 'Oxides_of_Nitrogen_B', 'Ozone_B', 'PM10_Particulate_B', 'PM2.5_Particulate_B', 'Sulphur_Dioxide_B')
corr_Background <- cor(WeatherDataVsGovUK[ , !(names(WeatherDataVsGovUK) %in% drop_Background)], use="pairwise.complete.obs", method=c("spearman"))
b<-ggcorrplot(corr_Background, hc.order = TRUE, type = "lower",
outline.col = "white",
ggtheme = ggplot2::theme_gray,
colors = c("#6D9EC1", "white", "#E46726"))+
ggtitle("Correlation of variables between Air Quality Data (Background) and TW Weather Data in London, 2008-2018")
ggplotly(b)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
not much of a difference whether the air quality data was collected from the background of London or at the roadside… similar trends are apparent between our weather data variables.
OzoneB_MaxTemp <- ggplot(WeatherDataVsGovUK, aes(x = Maxtemp, y=Ozone_B))+
geom_point()+
geom_smooth(method='lm', method.args = list(family = "symmetric"), formula=y~x)+
labs(x = "Maxtemp", y = "Ozone_B")+
theme(legend.position="none")
OzoneB_MinTemp <- ggplot(WeatherDataVsGovUK, aes(x = Mintemp, y=Ozone_B))+
geom_point()+
geom_smooth(method='lm', method.args = list(family = "symmetric"), formula=y~x)+
labs(x = "Mintemp", y = "Ozone_B")+
theme(legend.position="none")
OzoneR_MaxTemp <- ggplot(WeatherDataVsGovUK, aes(x = Maxtemp, y=Ozone_R))+
geom_point()+
geom_smooth(method='lm', method.args = list(family = "symmetric"), formula=y~x)+
labs(x = "Maxtemp", y = "Ozone_R")+
theme(legend.position="none")
OzoneR_MinTemp <- ggplot(WeatherDataVsGovUK, aes(x = Mintemp, y=Ozone_R))+
geom_point()+
geom_smooth(method='lm', method.args = list(family = "symmetric"), formula=y~x)+
labs(x = "Mintemp", y = "Ozone_R")+
theme(legend.position="none")
p <- ggplotly(subplot(OzoneB_MaxTemp, OzoneB_MinTemp, OzoneR_MaxTemp, OzoneR_MinTemp, nrows = 2, margin = 0.05, titleX = TRUE, titleY = TRUE)) %>%
layout(showlegend = FALSE, title = "Solar Levels Against Thames Water Data")
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
ggplotly(p)
These points are the same ones used for the correlations and for the first linegraphs of the document. It would have been better to use daily points for them, but the Air Quality Dataset is only monthly (and hourly..), so this isn’t possible.
Proving our earlier point, there isn’t much of a difference between background and roadside levels when it comes to relationships with our weather data. Ozone and temperature show a positive linear relationship, which is stronger looking at the maximum temperature than the minimum temperature of point. All plots show large levels of spread, though confidence intervals around the regression line are all relatively narrow.
library(sjPlot)
## Warning: package 'sjPlot' was built under R version 3.4.4
## Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
## TMB was built with Matrix version 1.2.14
## Current Matrix version is 1.2.10
## Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package
fit1 <- lm(Ozone_B ~ Maxtemp, data=WeatherDataVsGovUK)
fit2 <- lm(Ozone_B ~ Mintemp, data=WeatherDataVsGovUK)
fit3 <- lm(Ozone_B ~ Maxtemp + Mintemp, data=WeatherDataVsGovUK)
fit4 <- lm(Ozone_R ~ Maxtemp, data=WeatherDataVsGovUK)
fit5 <- lm(Ozone_R ~ Mintemp, data=WeatherDataVsGovUK)
fit6 <- lm(Ozone_R ~ Maxtemp + Mintemp, data=WeatherDataVsGovUK)
sjt.lm(fit1,fit2, fit3, file="output.html")# You have to save the table in html format.
## Fitted models have different coefficients. Grouping may not work properly. Set `group.pred = FALSE` if you encouter cluttered labelling.
| Â | Â | Ozone_B | Â | Ozone_B | Â | Ozone_B | ||||||
| Â | Â | B | CI | p | Â | B | CI | p | Â | B | CI | p |
| (Intercept) |  | 20.75 | 15.61 – 25.89 | <.001 |  | 28.23 | 24.22 – 32.24 | <.001 |  | 5.21 | -1.39 – 11.81 | .121 |
| Maxtemp |  | 1.01 | 0.70 – 1.32 | <.001 |  |  |  | 4.47 | 3.35 – 5.58 | <.001 | ||
| Mintemp |  |  |  | 0.99 | 0.56 – 1.42 | <.001 |  | -4.58 | -6.01 – -3.15 | <.001 | ||
| Observations | Â | 123 | Â | 123 | Â | 123 | ||||||
| R2 / adj. R2 | Â | .253 / .247 | Â | .146 / .139 | Â | .440 / .431 | ||||||
sjt.lm(fit4,fit5, fit6, file="output.html")
## Fitted models have different coefficients. Grouping may not work properly. Set `group.pred = FALSE` if you encouter cluttered labelling.
| Â | Â | Ozone_R | Â | Ozone_R | Â | Ozone_R | ||||||
| Â | Â | B | CI | p | Â | B | CI | p | Â | B | CI | p |
| (Intercept) |  | 17.25 | 13.16 – 21.34 | <.001 |  | 22.04 | 18.92 – 25.16 | <.001 |  | 6.77 | 1.29 – 12.26 | .016 |
| Maxtemp |  | 0.63 | 0.38 – 0.88 | <.001 |  |  |  | 2.96 | 2.04 – 3.89 | <.001 | ||
| Mintemp |  |  |  | 0.61 | 0.27 – 0.94 | <.001 |  | -3.09 | -4.28 – -1.90 | <.001 | ||
| Observations | Â | 123 | Â | 123 | Â | 123 | ||||||
| R2 / adj. R2 | Â | .173 / .167 | Â | .096 / .088 | Â | .322 / .311 | ||||||
These are quite interesting. Among the minimum temperature points, for 1 degree rise in temperature, background ozone levels go up by 0.99 (less than y=x), but among the maximum temperature of points, for each 1 degree rise in temperature, background ozone levels go up by 1.01 (more than y=x). Generally, then, as temperature rises by 1 degree, ozone levels increase by approximately 1 degree (between 0.99 and 1.01). However, differentiating intercepts between the maximum temperature regression line & the minimum temperature regression line influence this difference.
Looking at these results from both the London roadside data and the background data for ozone levels, we can reject the hypothesis of no linear relationship between temperature levels and ozone. There is signifiant evidence to suggest that the variation in ozone influenced by temperature is not due to chance.(less than 1% probability due to chance)
The 2nd set of data is a dataframe that contains information about the energy generated in varies parts of London & the south west between 27th July 2013 - 19th November 2014.It’s another dataframe that has come from the London Datastore site.
Code:
### Server connector
weatherConn <- odbcDriverConnect('driver={SQL Server};server=rlibdbwp01.twutil.net;database=weatherDB;trusted_connection=true')
#if we want to get everything from the weather data..
# sqlParam <- "SELECT * FROM daily "
# dailyData <- sqlQuery(weatherConn, sqlParam)
### Gets smd paramaters from the server
sqlParam <- "SELECT * FROM daily WHERE [RecordTypeNumber] = 2 AND [Location] = 'Central London' AND ([Date] BETWEEN '25 Nov 2013' AND '30 Nov 2014')" #London Air Quality data only goes back as far as 2008. Phil suggested using "record type number 2".
dailyData <- sqlQuery(weatherConn, sqlParam)
#Extracting Month and Years
dailyData <- dailyData %>%
mutate(Date = as.Date(dailyData$Date), Month = months(dailyData$Date), Year = format(dailyData$Date, format="%y"))
#Averages of the variables
DailyDataMean <- aggregate(cbind(Maxtemp, Mintemp, Sunshine, Precipitation, Frost) ~ Month + Year, data = dailyData, mean)
#by month and year type, as this is how the data from gov.uk is laid out.
#Loading Solar Data----
Solar_Data <- read_excel("H:/my office/Gwen Stuff/Projects/Project4 - Weather Data/WeatherDataVSOpenSourceData/Copy of Weather Data 2014-11-30.xlsx",
sheet = "by_Date_R")
Solar_Data$Date <- as.Date(Solar_Data$Date)
Solar_Data <- setnames(Solar_Data, old = c('Easthill Road', 'Elm Crescent', 'Forest Road', 'Maple Drive East', 'Grand Total'), new = c('Easthill_Road', 'Elm_Crescent', 'Forest_Road', 'Maple_Drive_East', 'Grand_Total_Solar'))
#Plots----
WeatherDataVsGovUK <- dailyData %>%
inner_join(Solar_Data)
## Joining, by = "Date"
Here are some plots of these variables (Solar from gov.uk, rest from TW) plotted against date (time series data)
library(methods)
Solar<- ggplot(WeatherDataVsGovUK, aes(x = Date, label = Date))+
geom_line(aes(y = Grand_Total_Solar, colour = "Total Solar Levels Across Regions"))+
scale_color_manual(values=c("#FF6600"))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Temp <- ggplot(WeatherDataVsGovUK, aes(x = Date, label = Date))+
geom_line(aes(y = Mintemp, colour = "Min temperature (°C)")) +
geom_line(aes(y = Maxtemp, colour = "Max temperature (°C)")) +
scale_color_manual(values=c("purple", "blue"))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Sunshine <- ggplot(WeatherDataVsGovUK, aes(x = Date, label = Date))+
geom_line(aes(y = Sunshine, colour = "Sunshine"))+
scale_color_manual(values=c("#FFCC33"))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Rain <- ggplot(WeatherDataVsGovUK, aes(x = Date, label = Date))+
geom_line(aes(y = Precipitation, colour = "Rain"))+
scale_color_manual(values=c("#6699CC"))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p <- ggplotly(subplot(Solar, Temp, Sunshine, Rain, nrows = 2, margin = 0.05))%>%
layout(title = "TW Weather data VS. London.gov.uk Solar Data")
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
ggplotly(p)
Here are some line graphs of the variables we’re most likely interested in between our weather data against London.gov.uk’s data on Solar energy generation levels. We see these variables ploted against time: how they all relate as time passes.
The x-axis is date, from 25/11/2013-30/11/2014. This time round we have daily data from our online source, so we can look at our data more closely and more accurately as we don’t need to do average levels for our weather data.
The y-axis is still “value” with no units. The difference this time is that each variable is in a different plot, rather than overlaying eachother like in the first subplot of this document. This is because as the data is fluctuating more frequently from the measurements being taken daily, it would be too messy to put multiple variables in the same plot. So, we can simply compare each of them between the subplots.
London.gov.uk Quality Data Variables:
Thames Water Weather Data Variables:
For all these variables, they peak & reach their lowest points only once (occasionally twice) a year.
Solar and sunshine levels follow similar trends across the year. When levels of solar energy dip down in the middle of the year at 3 points, sunshine levels dip on exactly the same day. These days are:
As sunshine dips on exactly the same days, it definitely shows a sign of comparative ability between the datasets.
On the other hand, sunshine levels fluctuate and vary much more than solar energy levels do. Sunshine levels can vary massively from one day to the next, but less of that is seen with solar energy levels. Maybe it’s worth asking:
What is the probability that the difference in solar from one day to the next is greater than X? What is the probability that the difference in sunshine levels from one day to the next is greater than X?
The difference between Solar levels and temperature is more interesting. There seems to be a delay in the highest temperatures of the year (June-August) compared with the highest solar energy generation levels of the year (April-June).
corr1 <- cor(WeatherDataVsGovUK[c("Maxtemp", "Mintemp", "Sunshine", "Precipitation", "Grand_Total_Solar")],use="pairwise.complete.obs", method=c("spearman"))
p<-ggcorrplot(corr1, hc.order = TRUE, type = "lower",
outline.col = "black",
ggtheme = ggplot2::theme_gray,
colors = c("#6D9EC1", "white", "#E46726"))+
ggtitle("Solar Data VS Weather Data")
ggplotly(p)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
A relatively strong positive correlation between maximum temperature and solar & sunshine and solar. Perhaps surprisingly, there is a stronger correlation between max. temperature and solar compared with sunshine and solar, when using the Spearman’s correlation method. Meaning, solar levels rise more when temperature rises than when sunshine levels rise.
Solar levels have a negative correlation with precipitaition, which is to be expected.
Here are plots of the data from TW’s weather data plotted against Solar levels (gov.uk) and their corresponding statistics.
Seeing these fits in action….
SolarVSMaxtemp <- ggplot(WeatherDataVsGovUK, aes(x = Maxtemp, y=Grand_Total_Solar))+
geom_point()+
geom_smooth(method='lm', method.args = list(family = "symmetric"), formula=y~x)
SolarVSMintemp <- ggplot(WeatherDataVsGovUK, aes(x = Mintemp, y=Grand_Total_Solar))+
geom_point()+
geom_smooth(method='lm', method.args = list(family = "symmetric"), formula=y~x)
SolarVSunshine <- ggplot(WeatherDataVsGovUK, aes(x = Sunshine, y=Grand_Total_Solar))+
geom_point()+
geom_smooth(method='lm', method.args = list(family = "symmetric"), formula=y~x)
SolarVSRain <- ggplot(WeatherDataVsGovUK, aes(x = Precipitation, y=Grand_Total_Solar))+
geom_point()+
geom_smooth(method='lm', method.args = list(family = "symmetric"), formula=y~x)
p <- ggplotly(subplot(SolarVSMaxtemp, SolarVSMintemp, SolarVSunshine, SolarVSRain, nrows = 2, margin = 0.05, titleX = TRUE)) %>%
layout(showlegend = FALSE, title = "Solar Levels Against Thames Water Data")
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
ggplotly(p)
Here we learn why the correlation coefficient for the correlation between minimum temperature and Solar levels isn’t as great as the correlation coefficient for maximum temperature and solar levels… Minimum temperature behavious is more eratic and rising away from, not along with, solar levels.
The scatter plot between precipitation and levels of solar energy is very peculiar. The confidence interval becomes very wide after the 15.47 point, as the outliers behave differently from the rest of the data.
library(sjPlot)
fit1 <- lm(Grand_Total_Solar ~ Maxtemp, data=WeatherDataVsGovUK)
fit2 <- lm(Grand_Total_Solar ~ Mintemp, data=WeatherDataVsGovUK)
fit3 <- lm(Grand_Total_Solar ~ Maxtemp + Mintemp, data=WeatherDataVsGovUK)
fit4 <- lm(Grand_Total_Solar ~ Sunshine, data=WeatherDataVsGovUK)
fit5<- lm(Grand_Total_Solar ~ Precipitation, data=WeatherDataVsGovUK)
sjt.lm(fit1,fit2, fit3, file="output.html")
## Fitted models have different coefficients. Grouping may not work properly. Set `group.pred = FALSE` if you encouter cluttered labelling.
| Â | Â | Grand_Total_Solar | Â | Grand_Total_Solar | Â | Grand_Total_Solar | ||||||
| Â | Â | B | CI | p | Â | B | CI | p | Â | B | CI | p |
| (Intercept) |  | -364.43 | -521.14 – -207.73 | <.001 |  | 235.88 | 105.88 – 365.88 | <.001 |  | -514.57 | -674.97 – -354.17 | <.001 |
| Maxtemp |  | 69.78 | 60.80 – 78.76 | <.001 |  |  |  | 112.23 | 94.62 – 129.83 | <.001 | ||
| Mintemp |  |  |  | 61.53 | 48.57 – 74.50 | <.001 |  | -61.21 | -83.32 – -39.11 | <.001 | ||
| Observations | Â | 371 | Â | 371 | Â | 371 | ||||||
| R2 / adj. R2 | Â | .387 / .386 | Â | .191 / .189 | Â | .433 / .430 | ||||||
sjt.lm(fit4,fit5, file="output.html")
## Fitted models have different coefficients. Grouping may not work properly. Set `group.pred = FALSE` if you encouter cluttered labelling.
| Â | Â | Grand_Total_Solar | Â | Grand_Total_Solar | ||||
| Â | Â | B | CI | p | Â | B | CI | p |
| (Intercept) |  | 217.31 | 125.63 – 308.99 | <.001 |  | 856.27 | 786.62 – 925.92 | <.001 |
| Sunshine |  | 113.90 | 98.64 – 129.16 | <.001 |  |  | ||
| Precipitation |  |  |  | -28.01 | -42.41 – -13.61 | <.001 | ||
| Observations | Â | 371 | Â | 371 | ||||
| R2 / adj. R2 | Â | .369 / .367 | Â | .038 / .036 | ||||
### Server connector
weatherConn <- odbcDriverConnect('driver={SQL Server};server=rlibdbwp01.twutil.net;database=weatherDB;trusted_connection=true')
#if we want to get everything from the weather data..
# sqlParam <- "SELECT * FROM daily "
# dailyData <- sqlQuery(weatherConn, sqlParam)
### Gets smd paramaters from the server
sqlParam <- "SELECT * FROM daily WHERE [RecordTypeNumber] = 2 AND [Location] = 'Central London' AND [Date] >= '1 Jan 2008' " #London Air Quality data only goes back as far as 2008. Phil suggested using "record type number 2".
dailyData <- sqlQuery(weatherConn, sqlParam)
#Extracting Month and Years
dailyData$Date <- as.Date(dailyData$Date)
dailyData$Month <- months(dailyData$Date)
dailyData$Year <- format(dailyData$Date,format="%y")
dailyData <- dailyData %>%
mutate(Difference = Maxtemp - Mintemp)
#Averages of the variables
#by month and year type, as this is how the data from gov.uk is laid out.
DailyDataMean <- aggregate(cbind(Maxtemp, Mintemp, Sunshine, Precipitation, Frost) ~ Month + Year, data = dailyData, mean)
#Q: are we interested in frost??
insight1 <- ggplot(dailyData, aes(x=Month, y=Difference))+
geom_boxplot()+
stat_summary(fun.y=mean, geom="line", aes(group=1))+
stat_summary(fun.y=mean, geom="point")
ggplotly(insight1)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Here is a boxplot showing the range of differences between the Maximum temperature and the minimum temperatures of each day collected between 2008 and 2018 for each month of the year.
Leo has introduced us to the package prophet: a forecasting procedure created by Facebook Open Source for time series data. It can be used in R and python.
More on Python can be found here: https://facebook.github.io/prophet/
Lets see how it can be implemented in our data.
### Server connector
weatherConn <- odbcDriverConnect('driver={SQL Server};server=rlibdbwp01.twutil.net;database=weatherDB;trusted_connection=true')
#if we want to get everything from the weather data..
# sqlParam <- "SELECT * FROM daily "
# dailyData <- sqlQuery(weatherConn, sqlParam)
### Gets smd paramaters from the server
sqlParam <- "SELECT * FROM daily WHERE [RecordTypeNumber] = 2 AND [Location] = 'Central London' AND [Date] >= '1 Jan 2008' " #London Air Quality data only goes back as far as 2008. Phil suggested using "record type number 2".
dailyData <- sqlQuery(weatherConn, sqlParam)
#Extracting Month and Years
dailyData$Date <- as.Date(dailyData$Date)
dailyData$Month <- months(dailyData$Date)
dailyData$Year <- format(dailyData$Date,format="%y")
dailyData <- dailyData %>%
mutate(Difference = Maxtemp - Mintemp)
#Averages of the variables
#by month and year type, as this is how the data from gov.uk is laid out.
DailyDataMean <- aggregate(cbind(Maxtemp, Mintemp, Sunshine, Precipitation, Frost) ~ Month + Year, data = dailyData, mean)
#Max temp data
MaxTempData <- dailyData %>%
dplyr::select(Date, Maxtemp) %>%
plyr::rename(c('Maxtemp'='y', 'Date' = 'ds'))
MaxTempProphet <- prophet(MaxTempData)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
## Initial log joint probability = -155.325
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
MaxTempFuture <- make_future_dataframe(MaxTempProphet, periods = 365)
MaxTempForecast <- predict(MaxTempProphet, MaxTempFuture)
#min temp
MinTempData <- dailyData %>%
dplyr::select(Date, Mintemp) %>%
plyr::rename(c('Mintemp'='y', 'Date' = 'ds'))
MinTempProphet <- prophet(MinTempData)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
## Initial log joint probability = -225.712
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
MinTempFuture <- make_future_dataframe(MinTempProphet, periods = 365)
MinTempForecast <- predict(MinTempProphet, MinTempFuture)
#The difference!
DifferenceMinMaxData <- dailyData %>%
dplyr::select(Date, Difference) %>%
plyr::rename(c('Difference'='y', 'Date' = 'ds'))
DifferenceMinMaxProphet <- prophet(DifferenceMinMaxData)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
## Initial log joint probability = -99.0907
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
DifferenceMinMaxFuture <- make_future_dataframe(DifferenceMinMaxProphet, periods = 365)
DifferenceMinMaxForecast <- predict(DifferenceMinMaxProphet, DifferenceMinMaxFuture)
#Q: are we interested in frost??
#max temp
ggplotly(plot(MaxTempProphet, MaxTempForecast))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
prophet_plot_components(MaxTempProphet, MaxTempForecast, weekly_start = 1)
#mintemp
ggplotly(plot(MinTempProphet, MinTempForecast))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
prophet_plot_components(MinTempProphet, MinTempForecast, weekly_start = 1)
ggplotly(plot(DifferenceMinMaxProphet, DifferenceMinMaxForecast))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
prophet_plot_components(DifferenceMinMaxProphet, DifferenceMinMaxForecast, weekly_start = 1)
What’s really interesting is that the difference between the minimum and maximum temperatures per day follow a really similar pattern to the normal trends! So, in general, when temperature reaches its peak per year (max temperature and min temperature will be reaching their peaks), then the difference between the two reach their peaks also!
But, this is assuming we know nothing about anything else that could be influencing these trends. What are the main reasons that minimum and maximum temperatures follow in this way? A way to do this is by adding regressors to the forecast, so that we see another trend with the regressors that could be having an impact.
Lets do this now for the Max temperature:
### Server connector
weatherConn <- odbcDriverConnect('driver={SQL Server};server=rlibdbwp01.twutil.net;database=weatherDB;trusted_connection=true')
#if we want to get everything from the weather data..
# sqlParam <- "SELECT * FROM daily "
# dailyData <- sqlQuery(weatherConn, sqlParam)
### Gets smd paramaters from the server
sqlParam <- "SELECT * FROM daily WHERE [RecordTypeNumber] = 2 AND [Location] = 'Central London' AND [Date] >= '1 Jan 2008' " #London Air Quality data only goes back as far as 2008. Phil suggested using "record type number 2".
dailyData <- sqlQuery(weatherConn, sqlParam)
#Extracting Month and Years
dailyData$Date <- as.Date(dailyData$Date)
dailyData$Month <- months(dailyData$Date)
dailyData$Year <- format(dailyData$Date,format="%y")
dailyData <- dailyData %>%
mutate(Difference = Maxtemp - Mintemp)
#Averages of the variables
#by month and year type, as this is how the data from gov.uk is laid out.
DailyDataMean <- aggregate(cbind(Maxtemp, Mintemp, Sunshine, Precipitation, Frost) ~ Month + Year, data = dailyData, mean)
dailyData <- dailyData %>%
plyr::rename(c('Maxtemp'='y', 'Date' = 'ds')) %>%
na.omit()
MaxTempProphet2 <- prophet()
MaxTempProphet2 <- add_regressor(MaxTempProphet2, 'Sunshine')
MaxTempProphet2 <- fit.prophet(MaxTempProphet2, dailyData)
## Disabling daily seasonality. Run prophet with daily.seasonality=TRUE to override this.
## Initial log joint probability = -155.273
## Optimization terminated normally:
## Convergence detected: relative gradient magnitude is below tolerance
MaxTempProphet2$changepoints <- as.Date(MaxTempProphet2$changepoints)
MaxTempFuture2 <- make_future_dataframe(MaxTempProphet2, periods = 365)
MaxTempFuture2$ds <- as.Date(MaxTempFuture2$ds)
MaxTempFuture2 <- MaxTempFuture2 %>%
left_join(dailyData) %>%
na.omit()
## Joining, by = "ds"
MaxTempForecast2 <- predict(MaxTempProphet2, MaxTempFuture2)
ggplotly(plot(MaxTempProphet2, MaxTempForecast2))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
prophet_plot_components(MaxTempProphet2, MaxTempForecast2, weekly_start = 1, plot_cap = FALSE)
This time we can see how sunshine changes throughout time and the affect that has on temperature. During the earlier months of the year (such as December and January), the extra_regressors plot dips down significantly, which means that the levels of sunshine during those months have a negative impact on temperature compared with if there was no sunshine affect at all. In contrast, during the warmer months, the levels oh sunshine have a positive impact on temperature and at most have an effect of 3 degrees extra on temperature.
The line plots of trends, weekly and yearly don’t change that much whether or not we consider sunshine as an additional regressor.